On bootstrap based variance estimation under fine stratification

The primary focus of all sample surveys is on providing point estimates for the parameters of primary interest, and also estimating the variance associated with those point estimates to quantify the uncertainty. Larger samples and important measurement tools can help to reduce the point estimates’ uncertainty. Numerous effective stratification criteria may be used in survey to reduce variance within stratum. In fine stratification design, the population is divided into numerous small strata, each containing a relatively small number of sampling units as one or two. This is done to ensure that certain characteristics or subgroups of the population are well-represented in the sample. But with many strata, the sample size within each stratum can become small, potentially resulting in higher errors and less stable estimates. The variance estimation process becomes difficult when we only have one unit per stratum. In that case, the collapsed stratum technique is the classical methods for estimating variance. This method, however, is biased and results in an overestimation of the variance. This paper proposes a bootstrap-based variance estimator for the total population under fine stratification, which overcomes the drawbacks of the previously explored estimation approach. Also, the estimator’s properties were investigated. A simulation study and practical application on survey of mental health organizations data were done to investigate properties of the proposed estimators. The results show that the proposed estimator performs well.


Introduction
Instead of enumerating the entire population, only the individuals in the sample survey are observed for the purpose of estimating population characteristics.The sample characteristics are used to approximate the population.The inaccuracy in such approximation is known as sampling error, and it is inherent and inescapable in all sampling designs.Nonetheless, when time and cost are considered, sampling results in considerable improvements.This is done to observe characteristics and subsequent handling of data.
A variety of sample selection designs are available, and careful selection will provide accurate and dependable estimates.Rough estimations of sample size n can be derived for each sampling strategy with the necessary degree of precision.The requirement for reliable estimates, generally for very small samples with limited survey resources, along with the various framing and sampling procedures, leads in a complex survey design that uses different sampling techniques.For units taken from a complex survey sampling design, the observed value of the variable of interest is neither independent nor identically distributed.In addition, survey processing strives to improve the quality and usability of survey data by eliminating estimation bias, meeting confidentiality rules and raising the survey's complexity [1], the examples can be found in [2,3].
The fundamental purpose of all sample surveys is to get a point estimate for the parameters of primary interest, and also estimating the variance associated with those point estimates to quantify the uncertainty.The significance of variance estimators and related standard errors stems from an estimator's estimated variance being the most critical component of its quality.
Estimating the sampling variance can be extremely difficult because of the complex sample design, non-linear estimators, and survey processing effects [4].Simple and exact analytic formulae for statistical variance estimations under various sample designs are offered in [5].No closed-form formulae for estimating variances exist when sample designs are more complicated or deployed in multiple phases.Furthermore, sophisticated weighting mechanisms make the variance estimation formula of simple statistics like totals challenging, even with simple sample designs.When there is no accurate technique for unbiased computing estimates of point estimates' standard errors, the only choice is to approximate the required quantities.A different approach is based on replication techniques to get results inside analytic techniques by applying simplified assumptions concerning the sample design or the statistic to be variance-estimated [6].
In a stratified sample design, the target population is divided into a finite number of subpopulations (strata) with homogeneous units that share at least one common trait, such as age, sex, educational or income level, geographical area and economic status among others.Homogeneous subpopulations are often defined by strata, thus reducing the total variance of the parameter of interest.Furthermore, because strata are supposed to be independent, stratification provides a flexible sampling technique per stratum, for example, simple random sampling (SRS) with or without replacement and systematic sampling.The samples in various strata are independent, each estimate and its related variance estimator are just the sums of the corresponding estimators inside each stratum.As a result, the difficulty of finding the proper variance estimator for a stratified single-stage sample is reduced to the problem of determining the optimal variance estimate for each stratum's sampling designs [7][8][9].Therefore, this study focuses on the specific issue of fine stratification design, in which the sample size per stratum is small as n i = 1 or n i = 2 primary sampling units (PSUs) selected using (SRS) without replacement.
The variance estimation process becomes difficult when we only have one unit per stratum.This scenario may arise if we have a highly fine stratification.Each stratum has a sample size greater than one, but only one responding unit exists; the sampling design itself imposes a single unit per stratum.For example, in [10] the new Canadian Health Measures Survey (CHMS) samples just a single PSU in one of its five strata, although CHMS estimates are required at the national level.Having a stratum with a single PSU is a fairly common problem.When there is only one PSU within a stratum, there is insufficient information with which to compute an estimate of that stratum's variance.Some of the suggested solutions and their corresponding drawbacks are detailed and discussed in [9,[11][12][13][14].
The variance for two or three PSUs per stratum, on the other hand, is significant.A key technique for estimating the variance of an unknown parameter under fine stratification is to collapse neighboring strata to form pseudo-strata with such a higher number of PSUs and then estimate their variance.
For the first time, the method was introduced by [15], although it frequently overestimates the estimator's variance.The collapsed stratum technique is the most commonly proposed strategy in the literature for dealing with this problem.The topic of collapsing strata for variance estimation with one unit per stratum is covered in [3,5,16,17].In either of these circumstances, it is challenging to calculate variances using one sampled unit per stratum directly.According to [15,18], several auxiliary variables that are correlated well to the expected values of strata's mean are recommended to minimize the bias of the variance estimator.Unfortunately, this type of useful auxiliary information may not be easily accessible for all strata.Mantel and his coauthor [10] presented to the CHMS a new technique on variance factors from distinct sample stages.Mosaferi [19] designed and constrained empirical Bayesian estimators for a one-unit variance per stratum sampling procedure.The author also compared one PSU per stratum design to two PSUs per stratum design, highlighting some of the inconsistencies of the proposed estimators due to the moment parameter estimation approach.
The collapsed stratum technique is the most commonly used strategy in the literature for dealing with this problem and its results in positive bias as discussed in section 2. By replacing a collapsed stratum estimator with a kernel-weighted stratum neighborhood and utilizing deviations from a fitted mean function, a nonparametric kernel-based technique for estimating variance was developed in [9].They demonstrated the superiority of their method over the collapsed stratum variance estimator nonparametric using the United States Consumer Expenditure Survey.A major weakness with the use of nonparametric Kernel-based regression over a finite range, is the bias at boundary points.The bias of the estimators towards the boundary points decreases at the expense of increasing variance.The trade-off between Bias and Variance has thus remained a issue.Most of the proposed alternative methods for collapsed stratum variation are based on concomitant or auxiliary information; however, this type of desirable auxiliary information may not be readily available for all strata.
Fine stratification is a popular design as it allows the population to be divided into numerous small strata, each containing a relatively small number of sampling units.This is done to ensure that certain characteristics or subgroups of the population are well-represented in the sample and lead to more precise estimates for specific subgroups.Some examples include the Current Population Survey and National Crime Victimization Survey both conducted by the U.S. Census Bureau, and the National Survey of Family Growth conducted by the University of Michigan's Institute for Social Research.Clearly, the fine stratification survey has proved useful in many applications as its point estimator is unbiased and efficient.In such situations, traditional variance estimation techniques may not perform well due to the limited number of observations in some strata.
This work suggests a bootstrap-based variance estimator for the total population under fine stratification as an additional method to overcome the inadequacies of previously explored estimate approaches.This method involves repeatedly resampling from the original sample with replacement to create multiple pseudo-samples.These pseudo-samples are used to compute the point estimator of interest (specifically the total) for each resample.The variance of the point estimator is then calculated based on the variability among these pseudo-estimates.The new method is detailed in section 4.
The paper structure is as follows: Section 2 offers a collapsing technique for variance estimation for the total population, section 3 presents non parametric kernel based variance estimation for the total population.Section 4 provides the bootstrap-based variance estimator development and the corresponding properties.Section 5 provides an empirical assessment of the findings, and Section 6 provides the conclusion.

Collapsing strata technique for variance estimation
When just one PSU is chosen per stratum or when only one PSU in a stratum participates, strata or PSUs are combined to generate pseudo or analytic strata for variance estimation.The number of PSUs in some sample designs can be extraordinarily enormous, especially in education and establishment surveys, where there can be thousands of first-stage units.In such cases, PSUs, strata, or both may be collapsed together.
Let the population total t ¼ P H i¼1 t i be estimated by t ¼ p k , where ti is unbiased estimator for the stratum total t i .Assuming a single element k is selected with inclusion probability π k from the stratum, the π k adds to unity in the stratum.In particular p k ¼ 1 N i for all k if simple random selection is used.After pairing the strata, let i and j refer to the two strata in i th and j th pair such that i = 1, . .., H and j = 1, . .., H.We suppose that the value of the study variables y k is observed without error for the unit k in the sample s.Our goal is to estimate the total population: Let us define indicator variable I k = 1 if k 2 s and I k = 0.If the second inclusion probability π kl > 0 for all fk; lg 2 U, the design is considered to be measurable, and the design variance admits an unbiased estimator as discussed in [20][21][22] and is given by: which is an unbiased estimation of t, and its variance is determined by where V i is defined by In [3,9,16] the collapsed stratum variance estimator is given by where c j ðiÞ ¼ 1 for i; j : i 6 ¼ j belong to the same collapsed stratum 0 otherwise The variance estimator given in ( 5) is design-biased, and its bias concerning the design is As shown in ( 6), the estimator in ( 5) has a positive bias, and the bias is small if the strata are successfully matched, in the sense that t i � t j and c j (i) = 1.To retain the statistical properties, the pairing must be conducted irrespective of any sample knowledge.There is also a temporal, geographical, or other structure in population that uses fine stratification that may be employed in pairing just to minimize the difference t i − t j [6].

Nonparametric kernel based variance estimation
To reduce the bias in (5), the alternative method was introduced in [9], where, the binary function c j (i) in Eq (5) was replaced by the kernel weights defined in section (1.3) of [9].The following equation, that is (7) provided the nonparametric variance estimator as an alternative to collapse variance estimator under fine stratification: The expectation variance of the estimator ( 7) is given by where the nonrandom normalizing constant, c d , depends on the kernel weights but not on the survey variables and it is defined by: The estimator (7) has also a positive bias defined by Therefore c d was chosen to reduce the part of the bias due to V i 's if the V i 's are constant across strata.

Bootstrap-based variance estimator
The collapsed variance estimator defined in Eq (5) has non negative bias.Its alternative nonparametric kernel-based estimator defined by Eq (7), also suffer from the boundary bias.As it is known that most kernel smoother have boundary problems and require modifications at the boundary points.That is, towards the boundary points the bias of the estimators decreases at the cost of an increasing variance.It is assumed that for any collapsing, the contribution to the bias of the variance estimator from each pair of strata is known and non negative.Therefore, we are coming up with a methodology of developing the bootstrap-based variance estimator V boot for a given set of H strata, that should be paired to reduce the bias of the variance estimator with no cost to the variance.When applying bootstrapping procedures this single unit can lead to a variety of issues.
The guiding principle in using the bootstrap method to stratified sampling designs is that the bootstrap replicate should itself be a stratified sample selected from the parent sample.However, in this paper, the parent sample usually has only one or two elements per stratum, which is meaningless in implementing resampling.Therefore, in this paper we combined a single unit at each stratum with the next smallest stratum to create the pseudo strata with at least two units, before applying the bootstrap process.Bootstrapping is applied after the process of collapsing the strata with the approximated characteristics which is the source of the bias.The bootstrap sampling is applied for the two groups of collapsed strata and for no collapsed strata by selecting a sample size of n* = 2 in each stratum.Therefore, the bootstrap bias corrector defined by Eq ( 13) is used to reduce the bias for the collapsed strata.
From (2), we define the bootstrap total population t b ¼ P H i¼1 p k by using the replication variable y * k in stratum population U i .For a given H, over all B resamples across the stratum, the bootstrap estimator of total population t b is calculated.We define the bias of an estimator tbj A bootstrap-based approximation to this bias is given by where tbj are copies of bootstrap of t bj .This construction is also based on standard bootstrap thinking to replace the population with the sample's empirical population.The following defines the bootstrap bias corrector: Then from (7) we replace the weights d j (i) by the bootstrap bias corrector defined in ( 13), therefore, the bootstrap variance estimator under fine stratification is given by: where âbj is the bootstrap bias corrector defined in (13) and c b is the nonrandom normalizing constant depending on bootstrap bias corrector and is defined by:

Properties of the proposed estimator
The bootstrap variance estimator is judged based on design expectation, design variance, mean squared error, and a specific sampling design for the fixed finite population.Therefore, we are interested in finding the above estimators' statistical properties about the sampling design.The design expectation of V boot is given by: For more details consider the Appendix A in S1 File.

4.1.1
The variance of the estimator.The design variance of V boot is given by: Hence the Variance of V boot is given by: The prove is on Appendix B in S1 File.4.1.2Mean squared error of the estimator.The design mean squared error of our estimator is expressed in terms of bias and variance.However, for the sake of simplicity, it is easily seen that V boot ¼ Varð tÞ and the bias of V boot is Accordingly, the design mean squared error of our estimator is given as: The MSE of the estimators could be simply used for the efficiency comparison, which includes the information of estimator variance and bias.By comparing Eqs (17) and (18), it is easy to see that both relations are approximately related thus the bias of our estimator is expected to be small.

Unconditional simulation study
In the simulation, we investigate the behavior of bootstrap-based variance estimators as compared to collapsed variance estimator and nonparametric kernel-based variance estimator.These are estimated under fine stratification at different strata, bandwidth, and standard deviation error values.For the nonparametric kernel based variance estimator in (7), the Epanechnikov kernel function, is used and bandwidths are chosen as 1/H < h < 2/H to yield smallest possible nonempty kernel window therefore h = {0.025,0.015, 0.045, 0.0055, 0.0075} has been considered, the detailed discussion seen in [9].The population x k is generated as a set of uniform (0, 1) random variables that are distributed independently and identically.A stratified finite population was created with eight survey variables of interest with H evenly sized strata of size N i = N/H and x i = i/H for stratum i.During this simulation, 1000 bootstrap samples were used to assess the estimator's quality.Then, the simulated data was stratified so that each stratum can have two primary sample units, and then evaluated the variance using the three variance estimators specified by Eqs ( 5), ( 7) and ( 14).Three possible values for standard deviation were considered: σ = 0, σ = 0.25, and σ = 0.5.For each of the seven variables, the population is N = 3000.Simple random sampling without replacement is used to create samples with stratum sizes of H = 50, H = 100, and H = 200 because fine stratification allows deep stratification, therefore a larger number of strata have been created and, in all cases, we consider H = 30 to be collapsed.
Increasing sample size has the same effect as lowering standard deviation.Therefore, estimators' design-averaged performance can be evaluated as the population is kept constant during these 1000 bootstrap samples.The design bias, variance, and mean squared error were calculated, and the mean functions were assumed for the eight variables of interest to be This indicates that for each of the first seven mean functions, the lowest is zero and the maximum is two.In all cases, the population values y ' k ; ð' ¼ 1; :::; 7Þ are generated from the mean functions by adding i.i.d N(0, σ 2 ) errors.That is; so that, the total is given by where the mean functions are defined as: Cycle1 : Cycle4 : The above means reflect a range of correct and incorrect model specifications for the estimators under evaluation [22].As the anticipated model is accurately defined, μ 1 , is the preferable estimator.As a result, it's fascinating to analyze how much efficiency is lost by assuming a smooth rather than underlying linear model.The remaining mean functions deviate from the linear model in various ways.The trend for μ 2 is quadratic and assumed linear model would be incorrect for the whole range of x k , but appropriate locally.Except for a bump presenting a small portion of the x k , range, the function μ 3 is linear across most of its range.The smoothness of the mean function μ 4 is not present.The function μ 6 is a sinusoid that completes one full cycle on [0, 1], whereas the function μ 7 completes four full cycles and μ 5 is exponential as discussed in [8,9,23].
When V i = 0, meaning that varð tÞ ¼ 0, Table 1 illustrates the precise biases of the variance estimators.The conclusions described here apply to any design because the t i values, the kernel, and H are the primary determinants of the variance estimators' expectation and bias in this case.Compared to the collapsed stratum variance estimator and the non-parametric kernel based variance estimator, the suggested bootstrap-based variance estimator has a substantially less bias for each response variable.Table 2 compares the bias of the three estimators for standard deviation error values other than zero.
The Table 3 compare both the RMSE of the non parametric kernel based variance estimator V ker and the bootstrap-based variance estimator V boot .
The suggested bootstrap variance estimator has a smaller RMSE than the collapsed stratum variance estimator, frequently significantly lower in every scenario studied.At each value of H, V boot outperforms V col because it has a more negligible bias; at higher strata, the variability of the two estimators is essentially comparable.The Table 4 compares both the Bias of the non parametric kernel based variance estimator V ker and the bootstrap-based variance estimator V boot .The simulation results in illustrate the difference in bias between the two variance estimators.The findings reinforce the preference of V boot especially for higher number of strata H.

Conditional simulation study
In order to prove the performance of the variance estimates depend on � x, we arranged the 1000 bootstrap samples from each population to increase values in � x.We then grouped the samples in 50 sets of 20 so that the first set contains 20 wherein � x are smallest, the next set contains the samples with the next 20 smallest in � x, and so on.For each of these so 50 sets, we calculated the average value of � x, the conditional root mean squared error (CRMSE), and the variance estimates' averages for all the variance estimators.Thereafter, the values of CRMSE and conditional biases against the average values of � x was plotted.Figs 1-14 show that the new estimator has a small RMSE and biases respectively in almost every scenario considered.
Different strata for all mean functions in deriving the biases were considered.It is clear that the new estimator is better in terms of having small bias under the same conditions than the estimators favoured in the current practice.By comparing the V ker and V col estimators, it was found that except for the quadratic, jump and exponential, when smoothing over the discontinuity is the incorrect strategy, the bias of V ker is substantially smaller than the bias of V col for any other response variable.By splitting the sample at the discontinuity, calculating two variance estimators, and then combining them, this might be readily solved in reality for both collapsed and kernel variance estimators.In terms of their RMSE of the V ker is small compared to the RMSE of V col except for the jump.

Data application
The performance of the developed estimator is evaluated by using the 1998 SMHO data available in the protocol package in R precisely available on https://rdrr.io/cran/PracTools/man/smho98.html.The Substance Abuse and Mental Health Services Administration in the United   divided into 16 strata.Then, after eliminating outpatient facilities, only organizations with a number of beds greater than zero remained.After that, the strata pairs {12, 13}, {10, 11}, {6, 8}, and {4, 5} have collapsed due to the small size of the leftover PSUs following the exclusion phase.As a result, we build eight new strata to estimate variance.The number of beds (total inpatient beds) and EXPTOTAL (total expenditures) were considered as the variables of interest, and strata were ordered by x i = log of total beds in stratum i.After collapsing the aforementioned strata, a simple random sample without replacement has been used to select a sample size of two PSUs per stratum (n i = 2) and estimate the variances using the three variance  estimators methods specified in the manuscript.Following that, we examine the coefficient of variation (CV), root mean squared error (RMSE) and the findings for each estimator are shown in the Table 5 as well as in Figs 15 and 16.
The bootstrap-based variance estimator has the lowest CV and RMSE values as well as the small bias among the rest.

Concluding remarks
A bootstrap-based variance estimator has been developed as an alternative to the collapsed varianceand the non parametric kernel-based variance estimators.These approaches are currently applied in fine stratification and are frequently used in survey statistics research.Fine stratification survey has proved useful in many applications as its point estimator is unbiased and efficient.A common practice to estimate the variance in this context are collapsing the adjacent strata to create pseudo-strata and then estimating the variance, and a non-parametric kernel-based variance estimator but both the attained estimator of variance are not designunbiased, and the bias increases as the population means of the pseudo-strata become more variant and these estimators may suffer from a large root mean squared error RMSEs.A  number of alternative variance estimators have been proposed in the literature, but they often rely on some strong auxiliary variables well-correlated with the response variable, or they have a complex form, which make them inapplicable in the real life.This paper proposes a viable solution for this long-standing problem based on a bootstrap-based variance estimator technique that replaces the pseudo strata and the kernel weight by the bootstrap bias corrector.Its properties have been determined, and the simulation study and the real data application show that the new estimator performs well in each case considered.It has a small root mean squared error compared to the current estimators under the same conditions.The proposed approach provides the variance estimates that appropriately account for the complexities of the sampling design and the specific characteristics of interest within the population.It leads to more accurate and precise statistical inference for complex survey data compared to existing approaches.